Skip to content

feat: add EAGLE3 support for Step-3.5-Flash#530

Open
zijiexia wants to merge 10 commits intosgl-project:mainfrom
zijiexia:support_step3p5
Open

feat: add EAGLE3 support for Step-3.5-Flash#530
zijiexia wants to merge 10 commits intosgl-project:mainfrom
zijiexia:support_step3p5

Conversation

@zijiexia
Copy link
Copy Markdown

@zijiexia zijiexia commented Apr 13, 2026

Summary

  • New chat template (specforge/data/template.py): registers step3.5, a thinking-enabled template using <|im_start|> / <|im_end|> tokens, matching Step-3.5-Flash's format.

  • New draft model config (configs/step-3.5-flash-eagle3.json): EAGLE3 architecture config for Step-3.5-Flash — 1-layer LlamaForCausalLMEagle3 with aux hidden states captured from layers 4, 20, 40.

  • Training script (examples/run_step3p5_flash_eagle3_online.sh): end-to-end online training script for EAGLE3 on Step-3.5-Flash with SGLang backend, FA3 attention, and W&B logging.

  • smoltalk-chinese dataset (scripts/prepare_data.py): adds process_smoltalk_row and wires up zjxia/smoltalk-chinese as a supported dataset option.

  • Fix sglang_max_total_tokens OOM for SWA models (specforge/args.py): changed target_batch_size * max_length to int(target_batch_size * max_length * 1.2). The 1.2× buffer is driven by three structural properties of SGLang's SWA memory allocator:

    1. Page-alignment overhead: alloc_paged_token_slots_extend over-reserves by batch_size × page_size slots on every extend call (mem_cache/common.py:267).
    2. Dual-pool double-counting: SWA models maintain two independent pools (full_attn and swa_attn), each independently applying the same overhead check (swa_memory_pool.py:370–414).
    3. SWA pool shrinkage: the SWA pool is sized at swa_full_tokens_ratio = 0.8× of the full pool, so it exhausts first. Compensating for shrinkage alone requires 1/0.8 = 1.25× — but factors 1 and 2 add further overhead on top, pushing the true requirement slightly above 1.25×. In practice, because page-alignment overhead is small (~0.78% per extend batch at batch=128, page_size=16, max_length=2048), 1.2× is empirically sufficient and avoids unnecessary over-reservation of the token pool.

Test plan

  • Run run_step3p5_flash_eagle3_online.sh and confirm training loop starts without OOM
  • Verify smoltalk-chinese dataset processes correctly via prepare_data.py --dataset smoltalk-chinese
  • Confirm step3.5 template tokenizes a sample conversation as expected

🤖 Generated with Claude Code

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@zijiexia
Copy link
Copy Markdown
Author

The support also requested changes on the sglang side, PR raised: sgl-project/sglang#22718

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant